Building Workflows That Are Repeatable and Never Lost
Understand idempotency and how it leads to repeatable workflows.
We'll cover the following
As DevOps engineers, we write tooling all the time. In small shops, many times, these are sets of scripts. In large shops, these are complicated systems.
As we may have gleaned from the introduction, it can be said that tool execution should always occur in a centralized service, regardless of scale. A basic service is easy to write, and we can expand and replace it as new needs arise.
But to make a workflow service work, two key concepts must be true of the workflows we create, as follows:
They must be repeatable.
They cannot be lost.
The first concept is that running a workflow more than once on the same infrastructure should produce the same result. We called this idempotency, borrowing the computer science term.
The second is that a workflow cannot be lost. If a tool creates a workflow to be executed by a system and the tool dies, the tool must be able to know that the workflow is running and resume watching it.
Building idempotent workflows#
Idempotency is a concept that if we make a call with the same parameters multiple times, we receive the same result. This is an important concept for writing certain types of software.
In infrastructure, we modify this definition slightly: an idempotent action is one that, if repeated with the same parameters and without changes to the infrastructure outside of this call, will return the same result.
Idempotency is key to making workflows that can be recovered when our workflow system goes down. Simple workflow systems can just repeat the entire workflow. More complicated systems can restart from where they left off.
Many times, developers don't think deeply about idempotency. For example, let's look at a simple operation to copy some content to a file. Here is a naive implementation:
The preceding code contains the following:
A
contentargument that represents content for a file.A
pargument, which is the path to the file.
It also does the following:
Writes content to file at
p.
This initially appears to be idempotent. If our workflow was killed after CopyToFile() was called but before io.WriteFile() was called, we could repeat this operation, and it initially looks as though if we called this twice, we would still get the same result.
But what if the file didn't exist and we created it but did not have permissions to edit an existing file? If our program died before recording the result of io.WriteFile() but after the change has occurred, a repeat of this action would report an error, and because the infrastructure did not change, the action is not idempotent.
Let's modify this to make it idempotent, as follows:
This code checks if the file exists and then does the following:
If it exists and it already has the content, it doesn't do anything.
If it doesn't, it writes the content.
This uses the standard library's sha256 package to calculate checksum hashes to validate if the content is the same.
The key to providing idempotency is often simply checking if the work is already done.
This leads us to a concept called three-way handshakes. This concept can be used in actions to provide idempotency when we need to talk to other systems via RPC. We will discuss how to use this concept in terms of executing workflows, but this can also be used in idempotent actions that talk to other services.
Using three-way handshakes to prevent workflow loss#
When we write an application that talks to a workflow service, it is important that the application never loses track of workflows that are running on our service.
The three-way handshake is a name we borrowed from Transmission Control Protocol (TCP). TCP has a handshake that establishes a socket between two machines. It consists of the following:
SYNchronize (SYN), a request to open a connection.
ACKnowledge (ACK), an acknowledgment of the request.
SYN-ACK, an acknowledgment of the ACK.
When a client sends a request to execute a workflow, we never want the workflow service to execute a workflow that the client doesn't know exists due to a crash of the client.
This can happen because the client program crashes or the machine the client is running on fails. If we sent a workflow and the service began executing after a single RPC, the client could crash after sending the RPC but before receiving an identifier (ID) for the workflow.
This would lead to a scenario where when the client was restarted, it did not know the workflow service was already running the workflow, and it might send another workflow that did the same thing.
To avoid that, instead of a single RPC to execute a workflow, a workflow should have a three-way handshake to do the following:
Send the workflow to the service.
Receive the workflow ID.
Send a request to execute the workflow with its ID to the service.
This allows the client to record the ID of the workflow before it executes. If the client crashes before recording the ID, the service simply has a non-running workflow record. If the client dies after the service begins execution, when the client restarts, it can check the status of the workflow. If it is running, it can simply monitor it. If it isn't running, it can request it to execute again.
For our workflow service, let's create a service definition that supports our three-way handshake using gRPC, as follows:
This defines a service with the following calls:
Submit submits a
WorkReqmessage that describes the work to be done.Exec executes a
WorkReqpreviously sent to the server withSubmit.Status retrieves the status of a
WorkReq.
The content of the messages for these service calls will be discussed in detail in the next section, but the key to this is that on Submit(), WorkResp will return an ID, but the workflow will not execute. When Exec() is called, we will send the ID we received from our Submit() call, and our Status() call allows us to check the status of any workflow.
We now have the basic definition of a workflow service that includes a three-way handshake to prevent any loss of workflows by our clients.
In this lesson, we have covered the basics of repeatable workflows that cannot be lost by our clients. We covered idempotency and how this leads to repeatable workflows. We have also shown how a three-way handshake allows us to prevent a running workflow from becoming lost.
We have also defined service calls that we will use in the workflow system we are building.
Now, we want to look at how tools can understand the scope of work (SOW) being executed to provide protection against runaway tooling.
Using Rate Limiters To Prevent Runaway Workflows
Using Policies to Restrict Tools